Multi-Class Model for Sign Language MNIST Using Python and XGBoost

David Lowe

November 17, 2020

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [https://machinelearningmastery.com/]

SUMMARY: This project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Sign Language MNIST dataset is a multi-class classification situation where we attempt to predict one of several (more than two) possible outcomes.

INTRODUCTION: The original MNIST image dataset of handwritten digits is a popular benchmark for image-based machine learning methods but researchers have renewed efforts to update it and develop drop-in replacements that are more challenging for computer vision and original for real-world applications. To stimulate the community to develop more drop-in replacements, the Sign Language MNIST is presented here and follows the same CSV format with labels and pixel values in single rows. The American Sign Language letter database of hand gestures represent a multi-class problem with 24 classes of letters (excluding J and Z which require motion).

The dataset format is patterned to match closely with the classic MNIST. Each training and test case represents a label (0-25) as a one-to-one map for each alphabetic letter A-Z (and no cases for 9=J or 25=Z because of gesture motions). The training data (27,455 cases) and test data (7172 cases) are approximately half the size of the standard MNIST but otherwise similar with a header row of label, pixel1,pixel2….pixel784 which represent a single 28x28 pixel image with grayscale values between 0-255. The original hand gesture image data represented multiple users repeating the gesture against different backgrounds.

ANALYSIS: The performance of the preliminary XGBoost model achieved an accuracy benchmark of 95.41%. After a series of tuning trials, the best XGBoost model processed the training dataset with an accuracy score of 99.68%. When we applied the final model to the previously unseen test dataset, we obtained an accuracy score of 78.93%, which pointed to a high variance error.

CONCLUSION: In this iteration, XGBoost appeared to be a suitable algorithm for modeling this dataset. We should consider experimenting with XGBoost for further modeling.

Dataset Used: Sign Language MNIST Data Set

Dataset ML Model: Multi-class with numerical attributes

Dataset Reference: https://www.kaggle.com/datamunge/sign-language-mnist

One source of potential performance benchmarks: https://www.kaggle.com/datamunge/sign-language-mnist

Any predictive modeling machine learning project generally can be broken down into about six major tasks:

  1. Prepare Environment
  2. Summarize and Visualize Data
  3. Pre-process Data
  4. Train and Evaluate Models
  5. Fine-tune and Improve Models
  6. Finalize Model and Present Analysis

Task 1 - Prepare Environment

In [1]:
# Install the necessary packages for Colab
!pip install python-dotenv PyMySQL
Collecting python-dotenv
  Downloading https://files.pythonhosted.org/packages/32/2e/e4585559237787966aad0f8fd0fc31df1c4c9eb0e62de458c5b6cde954eb/python_dotenv-0.15.0-py2.py3-none-any.whl
Collecting PyMySQL
  Downloading https://files.pythonhosted.org/packages/1a/ea/dd9c81e2d85efd03cfbf808736dd055bd9ea1a78aea9968888b1055c3263/PyMySQL-0.10.1-py2.py3-none-any.whl (47kB)
     |████████████████████████████████| 51kB 3.4MB/s 
Installing collected packages: python-dotenv, PyMySQL
Successfully installed PyMySQL-0.10.1 python-dotenv-0.15.0
In [2]:
# Retrieve the GPU information from Colab
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
    print('Select the Runtime → "Change runtime type" menu to enable a GPU accelerator, ')
    print('and then re-execute this cell.')
else:
    print(gpu_info)
Mon Nov  9 17:18:56 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    25W / 300W |      0MiB / 16130MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
In [3]:
# Retrieve the memory configuration from Colab
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
    print('To enable a high-RAM runtime, select the Runtime → "Change runtime type"')
    print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
    print('re-execute this cell.')
else:
    print('You are using a high-RAM runtime!')
Your runtime has 13.7 gigabytes of available RAM

To enable a high-RAM runtime, select the Runtime → "Change runtime type"
menu, and then select High-RAM in the Runtime shape dropdown. Then, 
re-execute this cell.
In [4]:
# Retrieve the CPU information
ncpu = !nproc
print("The number of available CPUs is:", ncpu[0])
The number of available CPUs is: 2

1.a) Load libraries and modules

In [5]:
# Set the random seed number for reproducible results
seedNum = 888
In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import sys
# import boto3
from datetime import datetime
from dotenv import load_dotenv
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# from sklearn.impute import SimpleImputer
# from sklearn.feature_selection import RFE
# from imblearn.pipeline import Pipeline
# from imblearn.over_sampling import SMOTE
# from imblearn.under_sampling import RandomUnderSampler
from xgboost import XGBClassifier

1.b) Set up the controlling parameters and functions

In [7]:
# Begin the timer for the script processing
startTimeScript = datetime.now()

# Set up the number of CPU cores available for multi-thread processing
n_jobs = 1

# Set up the flag to stop sending progress emails (setting to True will send status emails!)
notifyStatus = False

# Set Pandas options
pd.set_option("display.max_rows", 500)
pd.set_option("display.width", 140)

# Set the percentage sizes for splitting the dataset
test_set_size = 0.2
val_set_size = 0.25

# Set the number of folds for cross validation
n_folds = 5

# Set various default modeling parameters
scoring = 'accuracy'

# Set the number of classes for XGBoost
n_classes = 3
In [8]:
# Set up the parent directory location for loading the dotenv files
# useColab = True
# if useColab:
#     # Mount Google Drive locally for storing files
#     from google.colab import drive
#     drive.mount('/content/gdrive')
#     gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
#     env_path = '/content/gdrive/My Drive/Colab Notebooks/'
#     dotenv_path = env_path + "python_script.env"
#     load_dotenv(dotenv_path=dotenv_path)

# Set up the dotenv file for retrieving environment variables
# useLocalPC = True
# if useLocalPC:
#     env_path = "/Users/david/PycharmProjects/"
#     dotenv_path = env_path + "python_script.env"
#     load_dotenv(dotenv_path=dotenv_path)
In [9]:
# Set up the email notification function
def status_notify(msg_text):
    access_key = os.environ.get('SNS_ACCESS_KEY')
    secret_key = os.environ.get('SNS_SECRET_KEY')
    aws_region = os.environ.get('SNS_AWS_REGION')
    topic_arn = os.environ.get('SNS_TOPIC_ARN')
    if (access_key is None) or (secret_key is None) or (aws_region is None):
        sys.exit("Incomplete notification setup info. Script Processing Aborted!!!")
    sns = boto3.client('sns', aws_access_key_id=access_key, aws_secret_access_key=secret_key, region_name=aws_region)
    response = sns.publish(TopicArn=topic_arn, Message=msg_text)
    if response['ResponseMetadata']['HTTPStatusCode'] != 200 :
        print('Status notification not OK with HTTP status code:', response['ResponseMetadata']['HTTPStatusCode'])
In [10]:
if notifyStatus: status_notify("Task 1 - Prepare Environment has begun! " + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

1.c) Load dataset

In [11]:
dataset_path = 'https://dainesanalytics.com/datasets/kaggle-sign-language-mnist/sign_mnist_train.csv'
Xy_original = pd.read_csv(dataset_path, sep=',')

# Take a peek at the dataframe after import
Xy_original.head()
Out[11]:
label pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 pixel10 pixel11 pixel12 pixel13 pixel14 pixel15 pixel16 pixel17 pixel18 pixel19 pixel20 pixel21 pixel22 pixel23 pixel24 pixel25 pixel26 pixel27 pixel28 pixel29 pixel30 pixel31 pixel32 pixel33 pixel34 pixel35 pixel36 pixel37 pixel38 pixel39 ... pixel745 pixel746 pixel747 pixel748 pixel749 pixel750 pixel751 pixel752 pixel753 pixel754 pixel755 pixel756 pixel757 pixel758 pixel759 pixel760 pixel761 pixel762 pixel763 pixel764 pixel765 pixel766 pixel767 pixel768 pixel769 pixel770 pixel771 pixel772 pixel773 pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783 pixel784
0 3 107 118 127 134 139 143 146 150 153 156 158 160 163 165 159 166 168 170 170 171 171 171 172 171 171 170 170 169 111 121 129 135 141 144 148 151 154 157 160 ... 205 206 206 207 207 206 206 204 205 204 203 202 142 151 160 172 196 188 188 190 135 96 86 77 77 79 176 205 207 207 207 207 207 207 206 206 206 204 203 202
1 6 155 157 156 156 156 157 156 158 158 157 158 156 154 154 153 152 151 149 149 148 147 146 144 142 143 138 92 108 158 159 159 159 160 160 160 160 160 160 160 ... 100 78 120 157 168 107 99 121 133 97 95 120 135 116 95 79 69 86 139 173 200 185 175 198 124 118 94 140 133 84 69 149 128 87 94 163 175 103 135 149
2 2 187 188 188 187 187 186 187 188 187 186 185 185 185 184 184 184 181 181 179 179 179 178 178 109 52 66 77 83 188 189 189 188 188 189 188 188 188 188 187 ... 203 204 203 201 200 200 199 198 196 195 194 193 198 166 132 114 89 74 79 77 74 78 132 188 210 209 206 205 204 203 202 201 200 199 198 199 198 195 194 195
3 2 211 211 212 212 211 210 211 210 210 211 209 207 208 207 206 203 202 201 200 198 197 195 192 197 171 51 52 54 212 213 215 215 212 212 213 212 212 211 211 ... 247 242 233 231 230 229 227 225 223 221 220 216 58 51 49 50 57 60 17 15 18 17 19 1 159 255 237 239 237 236 235 234 233 231 230 226 225 222 229 163
4 13 164 167 170 172 176 179 180 184 185 186 188 189 189 190 191 189 190 190 187 190 192 193 191 191 192 192 194 194 166 169 172 174 177 180 182 185 186 187 190 ... 90 77 88 117 123 127 129 134 145 152 156 179 105 106 105 104 104 104 175 199 178 152 136 130 136 150 118 92 85 76 92 105 105 108 133 163 157 163 164 179

5 rows × 785 columns

In [12]:
Xy_original.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27455 entries, 0 to 27454
Data columns (total 785 columns):
 #   Column    Dtype
---  ------    -----
 0   label     int64
 1   pixel1    int64
 2   pixel2    int64
 3   pixel3    int64
 4   pixel4    int64
 5   pixel5    int64
 6   pixel6    int64
 7   pixel7    int64
 8   pixel8    int64
 9   pixel9    int64
 10  pixel10   int64
 11  pixel11   int64
 12  pixel12   int64
 13  pixel13   int64
 14  pixel14   int64
 15  pixel15   int64
 16  pixel16   int64
 17  pixel17   int64
 18  pixel18   int64
 19  pixel19   int64
 20  pixel20   int64
 21  pixel21   int64
 22  pixel22   int64
 23  pixel23   int64
 24  pixel24   int64
 25  pixel25   int64
 26  pixel26   int64
 27  pixel27   int64
 28  pixel28   int64
 29  pixel29   int64
 30  pixel30   int64
 31  pixel31   int64
 32  pixel32   int64
 33  pixel33   int64
 34  pixel34   int64
 35  pixel35   int64
 36  pixel36   int64
 37  pixel37   int64
 38  pixel38   int64
 39  pixel39   int64
 40  pixel40   int64
 41  pixel41   int64
 42  pixel42   int64
 43  pixel43   int64
 44  pixel44   int64
 45  pixel45   int64
 46  pixel46   int64
 47  pixel47   int64
 48  pixel48   int64
 49  pixel49   int64
 50  pixel50   int64
 51  pixel51   int64
 52  pixel52   int64
 53  pixel53   int64
 54  pixel54   int64
 55  pixel55   int64
 56  pixel56   int64
 57  pixel57   int64
 58  pixel58   int64
 59  pixel59   int64
 60  pixel60   int64
 61  pixel61   int64
 62  pixel62   int64
 63  pixel63   int64
 64  pixel64   int64
 65  pixel65   int64
 66  pixel66   int64
 67  pixel67   int64
 68  pixel68   int64
 69  pixel69   int64
 70  pixel70   int64
 71  pixel71   int64
 72  pixel72   int64
 73  pixel73   int64
 74  pixel74   int64
 75  pixel75   int64
 76  pixel76   int64
 77  pixel77   int64
 78  pixel78   int64
 79  pixel79   int64
 80  pixel80   int64
 81  pixel81   int64
 82  pixel82   int64
 83  pixel83   int64
 84  pixel84   int64
 85  pixel85   int64
 86  pixel86   int64
 87  pixel87   int64
 88  pixel88   int64
 89  pixel89   int64
 90  pixel90   int64
 91  pixel91   int64
 92  pixel92   int64
 93  pixel93   int64
 94  pixel94   int64
 95  pixel95   int64
 96  pixel96   int64
 97  pixel97   int64
 98  pixel98   int64
 99  pixel99   int64
 100 pixel100  int64
 101 pixel101  int64
 102 pixel102  int64
 103 pixel103  int64
 104 pixel104  int64
 105 pixel105  int64
 106 pixel106  int64
 107 pixel107  int64
 108 pixel108  int64
 109 pixel109  int64
 110 pixel110  int64
 111 pixel111  int64
 112 pixel112  int64
 113 pixel113  int64
 114 pixel114  int64
 115 pixel115  int64
 116 pixel116  int64
 117 pixel117  int64
 118 pixel118  int64
 119 pixel119  int64
 120 pixel120  int64
 121 pixel121  int64
 122 pixel122  int64
 123 pixel123  int64
 124 pixel124  int64
 125 pixel125  int64
 126 pixel126  int64
 127 pixel127  int64
 128 pixel128  int64
 129 pixel129  int64
 130 pixel130  int64
 131 pixel131  int64
 132 pixel132  int64
 133 pixel133  int64
 134 pixel134  int64
 135 pixel135  int64
 136 pixel136  int64
 137 pixel137  int64
 138 pixel138  int64
 139 pixel139  int64
 140 pixel140  int64
 141 pixel141  int64
 142 pixel142  int64
 143 pixel143  int64
 144 pixel144  int64
 145 pixel145  int64
 146 pixel146  int64
 147 pixel147  int64
 148 pixel148  int64
 149 pixel149  int64
 150 pixel150  int64
 151 pixel151  int64
 152 pixel152  int64
 153 pixel153  int64
 154 pixel154  int64
 155 pixel155  int64
 156 pixel156  int64
 157 pixel157  int64
 158 pixel158  int64
 159 pixel159  int64
 160 pixel160  int64
 161 pixel161  int64
 162 pixel162  int64
 163 pixel163  int64
 164 pixel164  int64
 165 pixel165  int64
 166 pixel166  int64
 167 pixel167  int64
 168 pixel168  int64
 169 pixel169  int64
 170 pixel170  int64
 171 pixel171  int64
 172 pixel172  int64
 173 pixel173  int64
 174 pixel174  int64
 175 pixel175  int64
 176 pixel176  int64
 177 pixel177  int64
 178 pixel178  int64
 179 pixel179  int64
 180 pixel180  int64
 181 pixel181  int64
 182 pixel182  int64
 183 pixel183  int64
 184 pixel184  int64
 185 pixel185  int64
 186 pixel186  int64
 187 pixel187  int64
 188 pixel188  int64
 189 pixel189  int64
 190 pixel190  int64
 191 pixel191  int64
 192 pixel192  int64
 193 pixel193  int64
 194 pixel194  int64
 195 pixel195  int64
 196 pixel196  int64
 197 pixel197  int64
 198 pixel198  int64
 199 pixel199  int64
 200 pixel200  int64
 201 pixel201  int64
 202 pixel202  int64
 203 pixel203  int64
 204 pixel204  int64
 205 pixel205  int64
 206 pixel206  int64
 207 pixel207  int64
 208 pixel208  int64
 209 pixel209  int64
 210 pixel210  int64
 211 pixel211  int64
 212 pixel212  int64
 213 pixel213  int64
 214 pixel214  int64
 215 pixel215  int64
 216 pixel216  int64
 217 pixel217  int64
 218 pixel218  int64
 219 pixel219  int64
 220 pixel220  int64
 221 pixel221  int64
 222 pixel222  int64
 223 pixel223  int64
 224 pixel224  int64
 225 pixel225  int64
 226 pixel226  int64
 227 pixel227  int64
 228 pixel228  int64
 229 pixel229  int64
 230 pixel230  int64
 231 pixel231  int64
 232 pixel232  int64
 233 pixel233  int64
 234 pixel234  int64
 235 pixel235  int64
 236 pixel236  int64
 237 pixel237  int64
 238 pixel238  int64
 239 pixel239  int64
 240 pixel240  int64
 241 pixel241  int64
 242 pixel242  int64
 243 pixel243  int64
 244 pixel244  int64
 245 pixel245  int64
 246 pixel246  int64
 247 pixel247  int64
 248 pixel248  int64
 249 pixel249  int64
 250 pixel250  int64
 251 pixel251  int64
 252 pixel252  int64
 253 pixel253  int64
 254 pixel254  int64
 255 pixel255  int64
 256 pixel256  int64
 257 pixel257  int64
 258 pixel258  int64
 259 pixel259  int64
 260 pixel260  int64
 261 pixel261  int64
 262 pixel262  int64
 263 pixel263  int64
 264 pixel264  int64
 265 pixel265  int64
 266 pixel266  int64
 267 pixel267  int64
 268 pixel268  int64
 269 pixel269  int64
 270 pixel270  int64
 271 pixel271  int64
 272 pixel272  int64
 273 pixel273  int64
 274 pixel274  int64
 275 pixel275  int64
 276 pixel276  int64
 277 pixel277  int64
 278 pixel278  int64
 279 pixel279  int64
 280 pixel280  int64
 281 pixel281  int64
 282 pixel282  int64
 283 pixel283  int64
 284 pixel284  int64
 285 pixel285  int64
 286 pixel286  int64
 287 pixel287  int64
 288 pixel288  int64
 289 pixel289  int64
 290 pixel290  int64
 291 pixel291  int64
 292 pixel292  int64
 293 pixel293  int64
 294 pixel294  int64
 295 pixel295  int64
 296 pixel296  int64
 297 pixel297  int64
 298 pixel298  int64
 299 pixel299  int64
 300 pixel300  int64
 301 pixel301  int64
 302 pixel302  int64
 303 pixel303  int64
 304 pixel304  int64
 305 pixel305  int64
 306 pixel306  int64
 307 pixel307  int64
 308 pixel308  int64
 309 pixel309  int64
 310 pixel310  int64
 311 pixel311  int64
 312 pixel312  int64
 313 pixel313  int64
 314 pixel314  int64
 315 pixel315  int64
 316 pixel316  int64
 317 pixel317  int64
 318 pixel318  int64
 319 pixel319  int64
 320 pixel320  int64
 321 pixel321  int64
 322 pixel322  int64
 323 pixel323  int64
 324 pixel324  int64
 325 pixel325  int64
 326 pixel326  int64
 327 pixel327  int64
 328 pixel328  int64
 329 pixel329  int64
 330 pixel330  int64
 331 pixel331  int64
 332 pixel332  int64
 333 pixel333  int64
 334 pixel334  int64
 335 pixel335  int64
 336 pixel336  int64
 337 pixel337  int64
 338 pixel338  int64
 339 pixel339  int64
 340 pixel340  int64
 341 pixel341  int64
 342 pixel342  int64
 343 pixel343  int64
 344 pixel344  int64
 345 pixel345  int64
 346 pixel346  int64
 347 pixel347  int64
 348 pixel348  int64
 349 pixel349  int64
 350 pixel350  int64
 351 pixel351  int64
 352 pixel352  int64
 353 pixel353  int64
 354 pixel354  int64
 355 pixel355  int64
 356 pixel356  int64
 357 pixel357  int64
 358 pixel358  int64
 359 pixel359  int64
 360 pixel360  int64
 361 pixel361  int64
 362 pixel362  int64
 363 pixel363  int64
 364 pixel364  int64
 365 pixel365  int64
 366 pixel366  int64
 367 pixel367  int64
 368 pixel368  int64
 369 pixel369  int64
 370 pixel370  int64
 371 pixel371  int64
 372 pixel372  int64
 373 pixel373  int64
 374 pixel374  int64
 375 pixel375  int64
 376 pixel376  int64
 377 pixel377  int64
 378 pixel378  int64
 379 pixel379  int64
 380 pixel380  int64
 381 pixel381  int64
 382 pixel382  int64
 383 pixel383  int64
 384 pixel384  int64
 385 pixel385  int64
 386 pixel386  int64
 387 pixel387  int64
 388 pixel388  int64
 389 pixel389  int64
 390 pixel390  int64
 391 pixel391  int64
 392 pixel392  int64
 393 pixel393  int64
 394 pixel394  int64
 395 pixel395  int64
 396 pixel396  int64
 397 pixel397  int64
 398 pixel398  int64
 399 pixel399  int64
 400 pixel400  int64
 401 pixel401  int64
 402 pixel402  int64
 403 pixel403  int64
 404 pixel404  int64
 405 pixel405  int64
 406 pixel406  int64
 407 pixel407  int64
 408 pixel408  int64
 409 pixel409  int64
 410 pixel410  int64
 411 pixel411  int64
 412 pixel412  int64
 413 pixel413  int64
 414 pixel414  int64
 415 pixel415  int64
 416 pixel416  int64
 417 pixel417  int64
 418 pixel418  int64
 419 pixel419  int64
 420 pixel420  int64
 421 pixel421  int64
 422 pixel422  int64
 423 pixel423  int64
 424 pixel424  int64
 425 pixel425  int64
 426 pixel426  int64
 427 pixel427  int64
 428 pixel428  int64
 429 pixel429  int64
 430 pixel430  int64
 431 pixel431  int64
 432 pixel432  int64
 433 pixel433  int64
 434 pixel434  int64
 435 pixel435  int64
 436 pixel436  int64
 437 pixel437  int64
 438 pixel438  int64
 439 pixel439  int64
 440 pixel440  int64
 441 pixel441  int64
 442 pixel442  int64
 443 pixel443  int64
 444 pixel444  int64
 445 pixel445  int64
 446 pixel446  int64
 447 pixel447  int64
 448 pixel448  int64
 449 pixel449  int64
 450 pixel450  int64
 451 pixel451  int64
 452 pixel452  int64
 453 pixel453  int64
 454 pixel454  int64
 455 pixel455  int64
 456 pixel456  int64
 457 pixel457  int64
 458 pixel458  int64
 459 pixel459  int64
 460 pixel460  int64
 461 pixel461  int64
 462 pixel462  int64
 463 pixel463  int64
 464 pixel464  int64
 465 pixel465  int64
 466 pixel466  int64
 467 pixel467  int64
 468 pixel468  int64
 469 pixel469  int64
 470 pixel470  int64
 471 pixel471  int64
 472 pixel472  int64
 473 pixel473  int64
 474 pixel474  int64
 475 pixel475  int64
 476 pixel476  int64
 477 pixel477  int64
 478 pixel478  int64
 479 pixel479  int64
 480 pixel480  int64
 481 pixel481  int64
 482 pixel482  int64
 483 pixel483  int64
 484 pixel484  int64
 485 pixel485  int64
 486 pixel486  int64
 487 pixel487  int64
 488 pixel488  int64
 489 pixel489  int64
 490 pixel490  int64
 491 pixel491  int64
 492 pixel492  int64
 493 pixel493  int64
 494 pixel494  int64
 495 pixel495  int64
 496 pixel496  int64
 497 pixel497  int64
 498 pixel498  int64
 499 pixel499  int64
 500 pixel500  int64
 501 pixel501  int64
 502 pixel502  int64
 503 pixel503  int64
 504 pixel504  int64
 505 pixel505  int64
 506 pixel506  int64
 507 pixel507  int64
 508 pixel508  int64
 509 pixel509  int64
 510 pixel510  int64
 511 pixel511  int64
 512 pixel512  int64
 513 pixel513  int64
 514 pixel514  int64
 515 pixel515  int64
 516 pixel516  int64
 517 pixel517  int64
 518 pixel518  int64
 519 pixel519  int64
 520 pixel520  int64
 521 pixel521  int64
 522 pixel522  int64
 523 pixel523  int64
 524 pixel524  int64
 525 pixel525  int64
 526 pixel526  int64
 527 pixel527  int64
 528 pixel528  int64
 529 pixel529  int64
 530 pixel530  int64
 531 pixel531  int64
 532 pixel532  int64
 533 pixel533  int64
 534 pixel534  int64
 535 pixel535  int64
 536 pixel536  int64
 537 pixel537  int64
 538 pixel538  int64
 539 pixel539  int64
 540 pixel540  int64
 541 pixel541  int64
 542 pixel542  int64
 543 pixel543  int64
 544 pixel544  int64
 545 pixel545  int64
 546 pixel546  int64
 547 pixel547  int64
 548 pixel548  int64
 549 pixel549  int64
 550 pixel550  int64
 551 pixel551  int64
 552 pixel552  int64
 553 pixel553  int64
 554 pixel554  int64
 555 pixel555  int64
 556 pixel556  int64
 557 pixel557  int64
 558 pixel558  int64
 559 pixel559  int64
 560 pixel560  int64
 561 pixel561  int64
 562 pixel562  int64
 563 pixel563  int64
 564 pixel564  int64
 565 pixel565  int64
 566 pixel566  int64
 567 pixel567  int64
 568 pixel568  int64
 569 pixel569  int64
 570 pixel570  int64
 571 pixel571  int64
 572 pixel572  int64
 573 pixel573  int64
 574 pixel574  int64
 575 pixel575  int64
 576 pixel576  int64
 577 pixel577  int64
 578 pixel578  int64
 579 pixel579  int64
 580 pixel580  int64
 581 pixel581  int64
 582 pixel582  int64
 583 pixel583  int64
 584 pixel584  int64
 585 pixel585  int64
 586 pixel586  int64
 587 pixel587  int64
 588 pixel588  int64
 589 pixel589  int64
 590 pixel590  int64
 591 pixel591  int64
 592 pixel592  int64
 593 pixel593  int64
 594 pixel594  int64
 595 pixel595  int64
 596 pixel596  int64
 597 pixel597  int64
 598 pixel598  int64
 599 pixel599  int64
 600 pixel600  int64
 601 pixel601  int64
 602 pixel602  int64
 603 pixel603  int64
 604 pixel604  int64
 605 pixel605  int64
 606 pixel606  int64
 607 pixel607  int64
 608 pixel608  int64
 609 pixel609  int64
 610 pixel610  int64
 611 pixel611  int64
 612 pixel612  int64
 613 pixel613  int64
 614 pixel614  int64
 615 pixel615  int64
 616 pixel616  int64
 617 pixel617  int64
 618 pixel618  int64
 619 pixel619  int64
 620 pixel620  int64
 621 pixel621  int64
 622 pixel622  int64
 623 pixel623  int64
 624 pixel624  int64
 625 pixel625  int64
 626 pixel626  int64
 627 pixel627  int64
 628 pixel628  int64
 629 pixel629  int64
 630 pixel630  int64
 631 pixel631  int64
 632 pixel632  int64
 633 pixel633  int64
 634 pixel634  int64
 635 pixel635  int64
 636 pixel636  int64
 637 pixel637  int64
 638 pixel638  int64
 639 pixel639  int64
 640 pixel640  int64
 641 pixel641  int64
 642 pixel642  int64
 643 pixel643  int64
 644 pixel644  int64
 645 pixel645  int64
 646 pixel646  int64
 647 pixel647  int64
 648 pixel648  int64
 649 pixel649  int64
 650 pixel650  int64
 651 pixel651  int64
 652 pixel652  int64
 653 pixel653  int64
 654 pixel654  int64
 655 pixel655  int64
 656 pixel656  int64
 657 pixel657  int64
 658 pixel658  int64
 659 pixel659  int64
 660 pixel660  int64
 661 pixel661  int64
 662 pixel662  int64
 663 pixel663  int64
 664 pixel664  int64
 665 pixel665  int64
 666 pixel666  int64
 667 pixel667  int64
 668 pixel668  int64
 669 pixel669  int64
 670 pixel670  int64
 671 pixel671  int64
 672 pixel672  int64
 673 pixel673  int64
 674 pixel674  int64
 675 pixel675  int64
 676 pixel676  int64
 677 pixel677  int64
 678 pixel678  int64
 679 pixel679  int64
 680 pixel680  int64
 681 pixel681  int64
 682 pixel682  int64
 683 pixel683  int64
 684 pixel684  int64
 685 pixel685  int64
 686 pixel686  int64
 687 pixel687  int64
 688 pixel688  int64
 689 pixel689  int64
 690 pixel690  int64
 691 pixel691  int64
 692 pixel692  int64
 693 pixel693  int64
 694 pixel694  int64
 695 pixel695  int64
 696 pixel696  int64
 697 pixel697  int64
 698 pixel698  int64
 699 pixel699  int64
 700 pixel700  int64
 701 pixel701  int64
 702 pixel702  int64
 703 pixel703  int64
 704 pixel704  int64
 705 pixel705  int64
 706 pixel706  int64
 707 pixel707  int64
 708 pixel708  int64
 709 pixel709  int64
 710 pixel710  int64
 711 pixel711  int64
 712 pixel712  int64
 713 pixel713  int64
 714 pixel714  int64
 715 pixel715  int64
 716 pixel716  int64
 717 pixel717  int64
 718 pixel718  int64
 719 pixel719  int64
 720 pixel720  int64
 721 pixel721  int64
 722 pixel722  int64
 723 pixel723  int64
 724 pixel724  int64
 725 pixel725  int64
 726 pixel726  int64
 727 pixel727  int64
 728 pixel728  int64
 729 pixel729  int64
 730 pixel730  int64
 731 pixel731  int64
 732 pixel732  int64
 733 pixel733  int64
 734 pixel734  int64
 735 pixel735  int64
 736 pixel736  int64
 737 pixel737  int64
 738 pixel738  int64
 739 pixel739  int64
 740 pixel740  int64
 741 pixel741  int64
 742 pixel742  int64
 743 pixel743  int64
 744 pixel744  int64
 745 pixel745  int64
 746 pixel746  int64
 747 pixel747  int64
 748 pixel748  int64
 749 pixel749  int64
 750 pixel750  int64
 751 pixel751  int64
 752 pixel752  int64
 753 pixel753  int64
 754 pixel754  int64
 755 pixel755  int64
 756 pixel756  int64
 757 pixel757  int64
 758 pixel758  int64
 759 pixel759  int64
 760 pixel760  int64
 761 pixel761  int64
 762 pixel762  int64
 763 pixel763  int64
 764 pixel764  int64
 765 pixel765  int64
 766 pixel766  int64
 767 pixel767  int64
 768 pixel768  int64
 769 pixel769  int64
 770 pixel770  int64
 771 pixel771  int64
 772 pixel772  int64
 773 pixel773  int64
 774 pixel774  int64
 775 pixel775  int64
 776 pixel776  int64
 777 pixel777  int64
 778 pixel778  int64
 779 pixel779  int64
 780 pixel780  int64
 781 pixel781  int64
 782 pixel782  int64
 783 pixel783  int64
 784 pixel784  int64
dtypes: int64(785)
memory usage: 164.4 MB
In [13]:
Xy_original.describe()
Out[13]:
label pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 pixel10 pixel11 pixel12 pixel13 pixel14 pixel15 pixel16 pixel17 pixel18 pixel19 pixel20 pixel21 pixel22 pixel23 pixel24 pixel25 pixel26 pixel27 pixel28 pixel29 pixel30 pixel31 pixel32 pixel33 pixel34 pixel35 pixel36 pixel37 pixel38 pixel39 ... pixel745 pixel746 pixel747 pixel748 pixel749 pixel750 pixel751 pixel752 pixel753 pixel754 pixel755 pixel756 pixel757 pixel758 pixel759 pixel760 pixel761 pixel762 pixel763 pixel764 pixel765 pixel766 pixel767 pixel768 pixel769 pixel770 pixel771 pixel772 pixel773 pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783 pixel784
count 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 ... 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000
mean 12.318813 145.419377 148.500273 151.247714 153.546531 156.210891 158.411255 160.472154 162.339683 163.954799 165.533673 166.685522 167.811983 168.495647 169.310872 169.956948 170.460463 170.716518 170.872701 170.808887 170.481442 169.979749 169.264506 168.144127 166.936660 165.765944 163.620725 161.933600 161.349117 147.146858 150.284502 152.941978 155.415043 158.068986 160.229576 162.345802 164.291167 165.736332 166.991732 168.503187 ... 131.949918 134.583755 139.361974 148.115862 155.095465 160.346858 163.915607 165.364414 165.911273 165.182080 164.407977 163.488254 143.407758 144.189474 145.711637 147.660718 149.019414 148.670843 148.185212 147.298926 146.286323 144.027062 142.966017 139.769550 137.071572 135.277181 131.922783 130.232235 132.046367 135.289237 141.104863 147.495611 153.325806 159.125332 161.969259 162.736696 162.906137 161.966454 161.137898 159.824731
std 7.287552 41.358555 39.942152 39.056286 38.595247 37.111165 36.125579 35.016392 33.661998 32.651607 31.279244 30.558445 29.771007 29.329251 28.620248 27.961255 27.053544 26.763535 26.307419 26.088459 26.475963 26.940885 27.871515 29.368386 30.906718 31.902723 34.303747 35.991306 36.571064 41.555429 40.094304 39.427215 38.686176 37.242459 36.373576 35.242915 33.899171 32.759395 31.656140 30.833853 ... 57.586028 60.373218 63.301768 63.300608 63.511566 62.551522 61.833119 62.210750 62.163516 61.958245 61.950177 62.403709 54.406116 53.658786 54.710497 54.613724 54.351761 55.233710 55.904715 57.155523 57.965677 57.393379 56.689603 56.118823 54.680900 55.278778 56.384076 58.111783 59.204300 62.553694 63.751194 65.512894 64.427412 63.708507 63.738316 63.444008 63.509210 63.298721 63.610415 64.396846
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 17.000000 23.000000 28.000000 34.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 6.000000 121.000000 126.000000 130.000000 133.000000 137.000000 140.000000 142.000000 144.000000 146.000000 148.000000 150.000000 151.000000 152.000000 153.000000 153.000000 154.000000 154.000000 154.000000 155.000000 154.000000 154.000000 154.000000 153.000000 152.000000 151.000000 149.000000 147.000000 147.000000 123.000000 128.000000 132.000000 135.000000 139.000000 141.000000 144.000000 146.000000 148.000000 149.000000 151.000000 ... 91.000000 90.000000 90.000000 98.000000 104.000000 115.000000 127.000000 132.000000 137.000000 138.000000 140.000000 138.000000 104.000000 104.000000 105.000000 106.000000 107.000000 107.000000 106.000000 106.000000 105.000000 104.000000 104.000000 102.000000 102.000000 99.000000 94.000000 90.000000 90.000000 89.000000 92.000000 96.000000 103.000000 112.000000 120.000000 125.000000 128.000000 128.000000 128.000000 125.500000
50% 13.000000 150.000000 153.000000 156.000000 158.000000 160.000000 162.000000 164.000000 165.000000 166.000000 167.000000 168.000000 169.000000 170.000000 170.000000 171.000000 171.000000 171.000000 171.000000 171.000000 170.000000 170.000000 170.000000 169.000000 168.000000 168.000000 167.000000 166.000000 165.000000 152.000000 155.000000 157.000000 160.000000 162.000000 164.000000 166.000000 167.000000 168.000000 169.000000 170.000000 ... 127.000000 130.000000 139.000000 161.000000 175.000000 181.000000 184.000000 185.000000 186.000000 185.000000 184.000000 184.000000 145.000000 146.000000 149.000000 151.000000 153.000000 153.000000 154.000000 152.000000 153.000000 152.000000 150.000000 146.000000 141.000000 137.000000 131.000000 128.000000 128.000000 132.000000 144.000000 162.000000 172.000000 180.000000 183.000000 184.000000 184.000000 182.000000 182.000000 182.000000
75% 19.000000 174.000000 176.000000 178.000000 179.000000 181.000000 182.000000 183.000000 184.000000 185.000000 186.000000 186.000000 187.000000 187.000000 187.000000 188.000000 188.000000 188.000000 187.000000 187.000000 187.000000 187.000000 186.000000 186.000000 186.000000 185.000000 184.000000 184.000000 183.000000 176.000000 178.000000 180.000000 181.000000 183.000000 184.000000 185.000000 186.000000 187.000000 187.000000 188.000000 ... 177.000000 186.000000 195.000000 201.000000 205.000000 207.000000 207.000000 208.000000 208.000000 207.000000 206.000000 205.000000 187.000000 187.000000 190.000000 192.000000 193.000000 193.000000 192.000000 193.000000 192.000000 188.000000 185.000000 181.000000 176.000000 174.000000 172.000000 173.000000 180.000000 189.000000 196.000000 202.000000 205.000000 207.000000 208.000000 207.000000 207.000000 206.000000 204.000000 204.000000
max 24.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 ... 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000

8 rows × 785 columns

In [14]:
Xy_original.isnull().sum()
Out[14]:
label       0
pixel1      0
pixel2      0
pixel3      0
pixel4      0
           ..
pixel780    0
pixel781    0
pixel782    0
pixel783    0
pixel784    0
Length: 785, dtype: int64
In [15]:
print('Total number of NaN in the dataframe: ', Xy_original.isnull().sum().sum())
Total number of NaN in the dataframe:  0

1.d) Data Cleaning

In [16]:
# Standardize the class column to the name of targetVar if required
Xy_original = Xy_original.rename(columns={'label': 'targetVar'})

1.e) Splitting Data into Sets

In [17]:
# Use variable totCol to hold the number of columns in the dataframe
totCol = len(Xy_original.columns)

# Set up variable totAttr for the total number of attribute columns
totAttr = totCol-1

# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# If (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization
targetCol = 1
In [18]:
# We create attribute-only and target-only datasets (X_original and y_original)
# for various visualization and cleaning/transformation operations

if targetCol == totCol:
    X_original = Xy_original.iloc[:,0:totAttr]
    y_original = Xy_original.iloc[:,totAttr]
else:
    X_original = Xy_original.iloc[:,1:totCol]
    y_original = Xy_original.iloc[:,0]

print("Xy_original.shape: {} X_original.shape: {} y_original.shape: {}".format(Xy_original.shape, X_original.shape, y_original.shape))
Xy_original.shape: (27455, 785) X_original.shape: (27455, 784) y_original.shape: (27455,)
In [19]:
# Split the data further into training and validation datasets
X_train_df = X_original.copy()
y_train_df = y_original.copy()
print("X_train_df.shape: {} y_train_df.shape: {}".format(X_train_df.shape, y_train_df.shape))
X_train_df.shape: (27455, 784) y_train_df.shape: (27455,)
In [20]:
if notifyStatus: status_notify("Task 1 - Prepare Environment completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

Task 2 - Summarize and Visualize Data

In [21]:
if notifyStatus: status_notify("Task 2 - Summarize and Visualize Data has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
In [22]:
# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol = 4
if totAttr % dispCol == 0 :
    dispRow = totAttr // dispCol
else :
    dispRow = (totAttr // dispCol) + 1
    
# Set figure width to display the data visualization plots
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = dispCol*4
fig_size[1] = dispRow*4
plt.rcParams["figure.figsize"] = fig_size
In [23]:
# Histograms for each attribute
X_train_df.hist(layout=(dispRow,dispCol))
plt.show()
In [24]:
# Box and Whisker plot for each attribute
X_train_df.plot(kind='box', subplots=True, layout=(dispRow,dispCol))
plt.show()
In [25]:
# Correlation matrix
fig = plt.figure(figsize=(16,12))
ax = fig.add_subplot(111)
correlations = X_train_df.corr(method='pearson')
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
plt.show()
In [26]:
if notifyStatus: status_notify("Task 2 - Summarize and Visualize Data completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

Task 3 - Pre-process Data

In [27]:
if notifyStatus: status_notify("Task 3 - Pre-process Data has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

3.a) Feature Scaling

In [28]:
# Compose pipeline for the numerical and categorical features
numeric_columns = X_train_df.select_dtypes(include=['int','float']).columns
numeric_transformer = Pipeline(steps=[
#     ('imputer', SimpleImputer(strategy='constant', fill_value=0)),
    ('scaler', preprocessing.StandardScaler())
])
categorical_columns = X_train_df.select_dtypes(include=['object','category']).columns
categorical_transformer = Pipeline(steps=[
#     ('imputer', SimpleImputer(strategy='constant', fill_value='NA')),
    ('onehot', preprocessing.OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_columns)
#     ('cat', categorical_transformer, categorical_columns)
])

print("Number of numerical columns:", len(numeric_columns))
print("Number of categorical columns:", len(categorical_columns))
print("Total number of columns in the dataframe:", X_train_df.shape[1])
Number of numerical columns: 784
Number of categorical columns: 0
Total number of columns in the dataframe: 784
In [29]:
# Apply the pre-processing pipeline to the training dataset
X_train = preprocessor.fit_transform(X_train_df)
print("Transformed X_train.shape:", X_train.shape)
print(X_train)
Transformed X_train.shape: (27455, 784)
[[-0.92895113 -0.76362507 -0.62085156 ...  0.66406257  0.65811336
   0.65493948]
 [ 0.23165213  0.2128048   0.12168008 ... -0.93157535 -0.41091344
  -0.16809718]
 [ 1.00538763  0.98894136  0.94102534 ...  0.52187701  0.51662452
   0.54623652]
 ...
 [ 0.69105758  0.63842807  0.58256179 ...  0.63246578  0.61095041
   0.62388149]
 [ 0.76359529  0.81368471  0.83860719 ... -1.54771277 -1.16552059
  -1.03772083]
 [ 0.81195376  0.78864805  0.73618903 ...  0.67986097  0.75243925
   0.8568164 ]]

3.b) Training Data Balancing

In [30]:
# Not applicable for this iteration of the project

3.c) Feature Selection

In [31]:
# Not applicable for this iteration of the project

3.d) Display the Final Datasets for Model-Building

In [32]:
# Apply the encoder to the training dataset
class_encoder = preprocessing.LabelEncoder()
y_train = class_encoder.fit_transform(y_train_df)
print("X_train.shape: {} y_train.shape: {}".format(X_train.shape, y_train.shape))
X_train.shape: (27455, 784) y_train.shape: (27455,)
In [33]:
if notifyStatus: status_notify("Task 3 - Pre-process Data completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

Task 4 - Train and Evaluate Models

In [34]:
if notifyStatus: status_notify("Task 4 - Train and Evaluate Models has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

4.a) Set test options and evaluation metric

In [35]:
# Set up Algorithms Spot-Checking Array
startTimeModule = datetime.now()
# train_models = [('XGB', XGBClassifier(random_state=seedNum, n_jobs=n_jobs, objective='multi:softmax', num_class=n_classes))]
train_models = [('XGB', XGBClassifier(random_state=seedNum, n_jobs=n_jobs, objective='multi:softmax', num_class=n_classes, tree_method='gpu_hist'))]
In [36]:
# Generate model in turn
for name, model in train_models:
	if notifyStatus: status_notify("Algorithm "+name+" modeling has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
	startTimeModule = datetime.now()
	kfold = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=seedNum)
	cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring, n_jobs=n_jobs, verbose=1)
	print("%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()))
	print(model)
	print ('Model training time:', (datetime.now() - startTimeModule), '\n')
	if notifyStatus: status_notify("Algorithm "+name+" modeling completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
XGB: 0.954107 (0.001617)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, num_class=3, objective='multi:softmax',
              random_state=888, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=None, silent=None, subsample=1, tree_method='gpu_hist',
              verbosity=1)
Model training time: 0:00:40.267035 

[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   40.3s finished

4.b) Algorithm Tuning

In [37]:
# Set up the comparison array
tune_results = []
tune_model_names = []
In [38]:
# Tuning XGBoost n_estimators, max_depth, and min_child_weight parameters
startTimeModule = datetime.now()
if notifyStatus: status_notify("Algorithm tuning iteration #1 has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

# tune_model1 = XGBClassifier(random_state=seedNum, n_jobs=n_jobs, objective='multi:softmax', num_class=n_classes)
tune_model1 = XGBClassifier(random_state=seedNum, n_jobs=n_jobs, objective='multi:softmax', num_class=n_classes, tree_method='gpu_hist')
tune_model_names.append('XGB_1')
paramGrid1 = dict(n_estimators=range(500,901,100), max_depth=np.array([3,6]), min_child_weight=np.array([1,2]))

kfold = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=seedNum)
grid1 = GridSearchCV(estimator=tune_model1, param_grid=paramGrid1, scoring=scoring, cv=kfold, n_jobs=n_jobs, verbose=1)
grid_result1 = grid1.fit(X_train, y_train)

print("Best: %f using %s" % (grid_result1.best_score_, grid_result1.best_params_))
tune_results.append(grid_result1.cv_results_['mean_test_score'])
means = grid_result1.cv_results_['mean_test_score']
stds = grid_result1.cv_results_['std_test_score']
params = grid_result1.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
print ('Model training time:',(datetime.now() - startTimeModule))
if notifyStatus: status_notify("Algorithm tuning iteration #1 completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Fitting 5 folds for each of 20 candidates, totalling 100 fits
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed: 87.0min finished
Best: 0.996176 using {'max_depth': 3, 'min_child_weight': 1, 'n_estimators': 800}
0.995666 (0.000371) with: {'max_depth': 3, 'min_child_weight': 1, 'n_estimators': 500}
0.995884 (0.000425) with: {'max_depth': 3, 'min_child_weight': 1, 'n_estimators': 600}
0.996103 (0.000668) with: {'max_depth': 3, 'min_child_weight': 1, 'n_estimators': 700}
0.996176 (0.000950) with: {'max_depth': 3, 'min_child_weight': 1, 'n_estimators': 800}
0.996176 (0.000991) with: {'max_depth': 3, 'min_child_weight': 1, 'n_estimators': 900}
0.995411 (0.000313) with: {'max_depth': 3, 'min_child_weight': 2, 'n_estimators': 500}
0.995702 (0.000497) with: {'max_depth': 3, 'min_child_weight': 2, 'n_estimators': 600}
0.995884 (0.000627) with: {'max_depth': 3, 'min_child_weight': 2, 'n_estimators': 700}
0.995957 (0.000603) with: {'max_depth': 3, 'min_child_weight': 2, 'n_estimators': 800}
0.995993 (0.000609) with: {'max_depth': 3, 'min_child_weight': 2, 'n_estimators': 900}
0.995010 (0.000455) with: {'max_depth': 6, 'min_child_weight': 1, 'n_estimators': 500}
0.995046 (0.000480) with: {'max_depth': 6, 'min_child_weight': 1, 'n_estimators': 600}
0.995046 (0.000422) with: {'max_depth': 6, 'min_child_weight': 1, 'n_estimators': 700}
0.995083 (0.000415) with: {'max_depth': 6, 'min_child_weight': 1, 'n_estimators': 800}
0.995156 (0.000455) with: {'max_depth': 6, 'min_child_weight': 1, 'n_estimators': 900}
0.994646 (0.000811) with: {'max_depth': 6, 'min_child_weight': 2, 'n_estimators': 500}
0.994719 (0.000870) with: {'max_depth': 6, 'min_child_weight': 2, 'n_estimators': 600}
0.994791 (0.000882) with: {'max_depth': 6, 'min_child_weight': 2, 'n_estimators': 700}
0.994828 (0.000882) with: {'max_depth': 6, 'min_child_weight': 2, 'n_estimators': 800}
0.994864 (0.000714) with: {'max_depth': 6, 'min_child_weight': 2, 'n_estimators': 900}
Model training time: 1:28:00.341060
In [39]:
# Tuning XGBoost subsample and colsample_bytree parameters
startTimeModule = datetime.now()
if notifyStatus: status_notify("Algorithm tuning iteration #2 has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

# tune_model2 = XGBClassifier(n_estimators=100, max_depth=6, min_child_weight=1, random_state=seedNum, n_jobs=n_jobs, objective='multi:softmax', num_class=n_classes)
tune_model2 = XGBClassifier(n_estimators=800, max_depth=3, min_child_weight=1, random_state=seedNum, n_jobs=n_jobs, objective='multi:softmax', num_class=n_classes, tree_method='gpu_hist')
tune_model_names.append('XGB_2')
paramGrid2 = dict(subsample=np.array([0.7,0.8,0.9,1.0]), colsample_bytree=np.array([0.7,0.8,0.9,1.0]))

kfold = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=seedNum)
grid2 = GridSearchCV(estimator=tune_model2, param_grid=paramGrid2, scoring=scoring, cv=kfold, n_jobs=n_jobs, verbose=1)
grid_result2 = grid2.fit(X_train, y_train)

print("Best: %f using %s" % (grid_result2.best_score_, grid_result2.best_params_))
tune_results.append(grid_result2.cv_results_['mean_test_score'])
means = grid_result2.cv_results_['mean_test_score']
stds = grid_result2.cv_results_['std_test_score']
params = grid_result2.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
print ('Model training time:',(datetime.now() - startTimeModule))
if notifyStatus: status_notify("Algorithm tuning iteration #2 completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
Fitting 5 folds for each of 16 candidates, totalling 80 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed: 75.4min finished
Best: 0.996831 using {'colsample_bytree': 0.8, 'subsample': 0.7}
0.996321 (0.000759) with: {'colsample_bytree': 0.7, 'subsample': 0.7}
0.996758 (0.000685) with: {'colsample_bytree': 0.7, 'subsample': 0.8}
0.996503 (0.000776) with: {'colsample_bytree': 0.7, 'subsample': 0.9}
0.996066 (0.000668) with: {'colsample_bytree': 0.7, 'subsample': 1.0}
0.996831 (0.000734) with: {'colsample_bytree': 0.8, 'subsample': 0.7}
0.996649 (0.000743) with: {'colsample_bytree': 0.8, 'subsample': 0.8}
0.996576 (0.001020) with: {'colsample_bytree': 0.8, 'subsample': 0.9}
0.996358 (0.000885) with: {'colsample_bytree': 0.8, 'subsample': 1.0}
0.996212 (0.000437) with: {'colsample_bytree': 0.9, 'subsample': 0.7}
0.996722 (0.000764) with: {'colsample_bytree': 0.9, 'subsample': 0.8}
0.996576 (0.000834) with: {'colsample_bytree': 0.9, 'subsample': 0.9}
0.996285 (0.000933) with: {'colsample_bytree': 0.9, 'subsample': 1.0}
0.996503 (0.000818) with: {'colsample_bytree': 1.0, 'subsample': 0.7}
0.996758 (0.000960) with: {'colsample_bytree': 1.0, 'subsample': 0.8}
0.996467 (0.000743) with: {'colsample_bytree': 1.0, 'subsample': 0.9}
0.996176 (0.000950) with: {'colsample_bytree': 1.0, 'subsample': 1.0}
Model training time: 1:16:28.476855

4.c) Compare Algorithms After Tuning

In [40]:
fig = plt.figure(figsize=(16,12))
fig.suptitle('Algorithm Comparison - Post Tuning')
ax = fig.add_subplot(111)
plt.boxplot(tune_results)
ax.set_xticklabels(tune_model_names)
plt.show()
In [41]:
if notifyStatus: status_notify("Task 4 - Train and Evaluate Models completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

Task 5 - Finalize Model and Present Analysis

In [42]:
if notifyStatus: status_notify("Task 5 - Finalize Model and Present Analysis has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
In [43]:
dataset_path = 'https://dainesanalytics.com/datasets/kaggle-sign-language-mnist/sign_mnist_test.csv'
Xy_test = pd.read_csv(dataset_path, sep=',')

# Take a peek at the dataframe after import
Xy_test.head()
Out[43]:
label pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 pixel10 pixel11 pixel12 pixel13 pixel14 pixel15 pixel16 pixel17 pixel18 pixel19 pixel20 pixel21 pixel22 pixel23 pixel24 pixel25 pixel26 pixel27 pixel28 pixel29 pixel30 pixel31 pixel32 pixel33 pixel34 pixel35 pixel36 pixel37 pixel38 pixel39 ... pixel745 pixel746 pixel747 pixel748 pixel749 pixel750 pixel751 pixel752 pixel753 pixel754 pixel755 pixel756 pixel757 pixel758 pixel759 pixel760 pixel761 pixel762 pixel763 pixel764 pixel765 pixel766 pixel767 pixel768 pixel769 pixel770 pixel771 pixel772 pixel773 pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783 pixel784
0 6 149 149 150 150 150 151 151 150 151 152 152 152 152 152 153 153 151 152 152 153 152 152 151 151 150 150 150 149 150 150 150 152 152 151 152 152 152 152 152 ... 131 134 144 147 125 87 87 103 107 110 116 113 75 74 74 74 76 74 82 134 168 155 146 137 145 146 149 135 124 125 138 148 127 89 82 96 106 112 120 107
1 5 126 128 131 132 133 134 135 135 136 138 137 137 138 138 139 137 142 140 138 139 137 137 136 135 134 133 134 132 129 132 134 135 135 137 139 139 139 140 141 ... 114 112 89 48 133 194 182 185 184 184 182 181 172 174 177 178 178 179 181 183 187 175 165 154 118 107 100 75 96 83 47 104 194 183 186 184 184 184 182 180
2 10 85 88 92 96 105 123 135 143 147 152 157 163 168 171 182 172 175 185 183 184 185 185 185 183 183 182 181 178 86 88 93 96 108 125 137 145 149 154 160 ... 145 123 78 162 239 227 229 226 226 225 224 222 89 91 94 111 136 154 167 184 125 3 166 225 195 188 172 185 161 122 68 166 242 227 230 227 226 225 224 222
3 0 203 205 207 206 207 209 210 209 210 209 208 207 207 209 208 210 210 207 209 209 208 209 210 209 207 208 209 207 206 208 209 208 208 210 211 210 211 209 209 ... 85 80 84 151 238 255 255 250 237 245 250 232 103 101 102 103 95 208 231 227 209 190 179 182 152 150 159 119 83 63 154 248 247 248 253 236 230 240 253 255
4 3 188 191 193 195 199 201 202 203 203 203 204 204 204 203 202 198 216 217 135 181 200 195 194 193 190 189 187 185 190 194 196 197 200 202 204 206 207 207 206 ... 93 52 24 53 63 33 41 51 48 45 49 55 149 150 150 148 147 151 124 82 84 81 69 81 111 103 84 75 53 28 26 40 64 48 29 46 49 46 46 53

5 rows × 785 columns

In [44]:
# Standardize the class column to the name of targetVar if required
Xy_test = Xy_test.rename(columns={'label': 'targetVar'})
In [45]:
X_test_df = Xy_test.iloc[:,1:totCol]
y_test_df = Xy_test.iloc[:,0]
print("Xy_test.shape: {} X_test_df.shape: {} y_test_df.shape: {}".format(Xy_test.shape, X_test_df.shape, y_test_df.shape))
Xy_test.shape: (7172, 785) X_test_df.shape: (7172, 784) y_test_df.shape: (7172,)
In [46]:
# Apply the pre-processing pipeline to the test dataset
X_test = preprocessor.transform(X_test_df)
print("Transformed X_test.shape:", X_test.shape)
print(X_test)
Transformed X_test.shape: (7172, 784)
[[ 0.08657672  0.01251149 -0.03194715 ... -0.78938979 -0.64672817
  -0.82031492]
 [-0.46954568 -0.51325843 -0.5184334  ...  0.34809467  0.32797273
   0.31330162]
 [-1.4608943  -1.51472496 -1.51701044 ...  0.99582887  0.98825399
   0.96551935]
 ...
 [ 1.07792534  1.06405134  0.99223442 ...  0.77465134  0.75243925
   0.74811344]
 [ 1.34389692  1.41456463  1.45311613 ... -1.50031758 -1.43277729
  -1.50359064]
 [ 0.66687835  0.63842807  0.55695725 ...  0.52187701  0.50090354
   0.49964954]]
In [47]:
# Apply the encoder to the test dataset
y_test = class_encoder.transform(y_test_df)
print("X_test.shape: {} y_test.shape: {}".format(X_test.shape, y_test.shape))
X_test.shape: (7172, 784) y_test.shape: (7172,)
In [48]:
# test_model = XGBClassifier(n_estimators=100, max_depth=6, min_child_weight=1, colsample_bytree=0.8, subsample=0.8, random_state=seedNum, n_jobs=n_jobs, objective='multi:softmax')
test_model = XGBClassifier(n_estimators=800, max_depth=3, min_child_weight=1, colsample_bytree=0.8, subsample=0.7, random_state=seedNum, n_jobs=n_jobs, objective='multi:softmax', tree_method='gpu_hist')
test_model.fit(X_train, y_train)
print(test_model)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=800, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=888,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=0.7, tree_method='gpu_hist', verbosity=1)
In [49]:
test_predictions = test_model.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, test_predictions))
print(confusion_matrix(y_test, test_predictions))
print(classification_report(y_test, test_predictions))
Accuracy Score: 0.7893195761293921
[[330   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1
    0   0   0   0   0   0]
 [  0 408   0   0   0   0   0   0   0  20   0   0   0   0   0   0   0   0
    0   0   0   4   0   0]
 [  0   0 289   0   0  20   0   0   0   0   1   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0 238   1   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   6   0]
 [  2   0   0   0 466   0   0   0   0   0   0   0   0   0   0   0   0  30
    0   0   0   0   0   0]
 [  0   0  14   0   0 229   0   0   0   0   0   0   0   1   0   0   0   0
    3   0   0   0   0   0]
 [  0   0   0   0   0   0 282   4   0   0   0   0   6   0  17   3   0   0
   36   0   0   0   0   0]
 [  0   0   0   0   0   0  33 374   0   0   0   0   0   0   0   1   0   0
   20   8   0   0   0   0]
 [  0   0   0   0   0   0   0   0 231   1   0   0   1   0   0   0   0   8
   10   0   3   0   0  34]
 [  0   0   0   0   0   0   0   0   0 244   2   0   0   0  10   0  28  18
    0   5   6  18   0   0]
 [  0   0   0   0   0   0   0   0   0   0 209   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0  34   0   0   2   1   0   0 270  22   0   0   9   0  53
    0   0   0   0   0   3]
 [ 39   0   0  16  16   0   0   0   0   0   0  38 135   0   1   5   0  21
   20   0   0   0   0   0]
 [  0   0  16   0   3  14   3   0   0   0   0   0   2 159   0  31   0   8
    6   0   0   0   4   0]
 [  0   0   0   0   0   0   0   2   0   0   0   0   0   0 338   2   0   0
    0   0   0   5   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 159   0   5
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0  17   0   1   0   0   0  64   0
    2  57   2   1   0   0]
 [  0   1   0   0  21   0   0   0   3   0   0  32   1   0   0   2   0 184
    0   0   0   1   0   1]
 [  0   0   0   0   0   0   0   0   2   0  19   0   0   0   0   0   2   0
  173   0   1   5  43   3]
 [  0   3   0   0   0   0   0   0   0  25   0   0   0   0   0   0  62   0
    0 161  12   0   0   3]
 [  0   0   0   1   0  12   0   0   0   2   0   0   0   0   0   0  36   1
    0   9 216  60   9   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   2   0
    0  44  35 125   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   3   0
   15   0   4  66 179   0]
 [  0   0   0   0   0   0  14   0  14   0   3   0   0   0   0   0  34  24
   28   0  10   7   0 198]]
              precision    recall  f1-score   support

           0       0.89      1.00      0.94       331
           1       0.99      0.94      0.97       432
           2       0.91      0.93      0.92       310
           3       0.93      0.97      0.95       245
           4       0.86      0.94      0.90       498
           5       0.83      0.93      0.88       247
           6       0.85      0.81      0.83       348
           7       0.98      0.86      0.91       436
           8       0.92      0.80      0.86       288
           9       0.84      0.74      0.78       331
          10       0.83      1.00      0.91       209
          11       0.79      0.69      0.74       394
          12       0.80      0.46      0.59       291
          13       0.99      0.65      0.78       246
          14       0.92      0.97      0.95       347
          15       0.75      0.97      0.85       164
          16       0.28      0.44      0.34       144
          17       0.52      0.75      0.61       246
          18       0.55      0.70      0.62       248
          19       0.57      0.61      0.59       266
          20       0.75      0.62      0.68       346
          21       0.43      0.61      0.50       206
          22       0.74      0.67      0.70       267
          23       0.82      0.60      0.69       332

    accuracy                           0.79      7172
   macro avg       0.78      0.78      0.77      7172
weighted avg       0.81      0.79      0.79      7172

In [50]:
if notifyStatus: status_notify("Task 5 - Finalize Model and Present Analysis completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
In [51]:
print ('Total time for the script:',(datetime.now() - startTimeScript))
Total time for the script: 2:49:59.654305